test: add reproducer for case-insensitive write rejection (same field ID, different column casing)#562
Open
pandaamit91 wants to merge 3 commits intolinkedin:mainfrom
Conversation
… from stored schema Responds to the reviewer's observation that "writes with different casing already succeed." These tests establish the baseline behavior before any fix is applied. Three scenarios are documented: 1. testPositionalInsert_succeedsRegardlessOfStoredCasing Positional INSERT (no column list) never needs to resolve column names, so casing differences are irrelevant. Works unconditionally. 2. testExplicitColumnInsert_succeedsWithDefaultCaseSensitivity INSERT with an explicit lowercase column list (e.g. "id") against a table that stores "ID" succeeds with the Spark default (caseSensitive=false). Spark resolves "id" → "ID" at analysis time, so the server receives the correct casing. This confirms the reviewer's observation. 3. testExplicitColumnInsert_failsWhenCaseSensitiveEnabled The same explicit-column INSERT fails with an AnalysisException when spark.sql.caseSensitive=true. "id" cannot be resolved against "ID" on the client before the request ever reaches the server. Together these tests show: writes already work under the default Spark configuration, but fail once caseSensitive=true is in effect — a gap that exists independently of any server-side schema normalization. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d634c9c to
13395e9
Compare
… append Add testDataFrameWriteTo_failsWhenCaseSensitiveEnabled to complete the characterization of existing write behavior. With caseSensitive=true, Spark cannot resolve lowercase "test" or ALL-CAPS "TEST" against stored "TeSt", so both writeTo().append() variants throw AnalysisException before reaching the server — documenting the gap that exists regardless of any server-side normalization fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… Spark 3.1 Spark 3.1's ResolveInsertInto rule matches INSERT column list names case-sensitively regardless of spark.sql.caseSensitive. Rename testExplicitColumnInsert_succeedsWithDefaultCaseSensitivity to testExplicitColumnInsert_failsEvenWithDefaultCaseSensitivity and flip its assertion to assertThrows, matching the actual observed behavior. Update class-level Javadoc to reflect the corrected findings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
17 tasks
Collaborator
|
Thanks @pandaamit91 it is clear the current state of the world. what is the AI for spark client since the mixed case will throw error before it lands on OH server? |
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
We have noticed that OH writes with different column casing already succeed in some cases and we want to validate the existing behavior before applying any fix. This PR does that — it adds characterization tests that document exactly which write paths work today and which ones don't, with an explanation of why.
Key Findings
column names, so the commit carries the unchanged existing schema — writeSchema.sameSchema(tableSchema) is true and the server's validateWriteSchema is never invoked. DaliSpark (a wrapper over df.writeTo()) gets this for free.
governs column references in SELECT expressions, not INSERT column lists). The unresolved column is silently dropped, causing AnalysisException: not enough data columns.
case-insensitive resolution layer, so the server is the only place to normalize.
Changes
Testing Done
Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.
Test: Adds CaseInsensitiveWriteTest — a mock-based Spark e2e characterization test that establishes a baseline of which write paths already handle case-mismatched column names before any server-side fix is applied.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.